Introduction

In this study we’ll be investigating the relationship between a few neighborhood characteristics with median home value across U.S. Census block groups in Philadelphia, PA. Home value is a metric of interest to many from different fields. Current and prospective homeowners and the private sector have obvious interest in understanding which neighborhood characteristic influence house price. But justice-oriented organizations as well as local governments also have interest in what factors influence home prices. For example, many local government budgets rely on property taxes – a metric they have a hand in setting. Understanding what factors may lead to greater home prices and thus greater property taxes and more revenue for the city, can help local governments plan their budgets but also programs that can help neighborhoods grow.

Here we examine the effect that education, poverty, housing form, and vacancies have on house prices across Philadelphia. There are some ideas we have about the effects of our variables, such as poverty. Common sense would suggest that if neighborhood (or block group) poverty is high, then it’s unlikely that house prices in that area will be high. Additionally, given the exorbitant cost of higher education in the United States and that those with college degrees generally earn more, it’s also likely that a more educated block group might also have higher house prices. We’d like to be able to examine if any of these variables are statistically significant, and that is what we plan to do. The rest of this report will describe how we designed our study, the results, and discussion.

Methods

a. Data Cleaning

Our study relies on census data, that includes observations for various demographic variables at the census block group level. We began with a data set of 1,816 observation and filtered out block groups with small populations (<40), block groups with no housing units, and block groups with a median house value less than $10,000. Additionally, a two separate block groups in North Philadelphia were removed due to a very high median house value (greater than $80,000) and very low median household income (less than $8,000). We were left with 1,720 observations to use in our analysis.

b. Exploratory Analysis

The first step we took in analyzing our data set, was to calculate summary statistics. These statistics include the mean and standard deviation of both our dependent variable Median House Value (MEDHVAL) and four independent predictors:

  • Number of Households Living Below the Poverty Line (NBELPOV)

  • Percent of housing units that are detached single family houses (PCTSINGLES)

  • Percent of housing units that are vacant (PCTVACANT)

  • Percent of residents in Block Group with at least a bachelor’s degree (PCTBACHMOR)

Doing this gives us a cursory understanding of what the values our predictors and dependent variable look like. Additionally, we explored the distribution of our data through the use of histograms. With a histogram plot of the values of our variables, we can determine if they are normally distributed. This is important for determining which kind of regression we will use. Different regression models have different assumptions. Some suggest that the residuals, or the estimate of error of observations in our sample, for each variable should be normally distributed.

It is possible for a non-normally distributed variable to have normally distributed values, but it’s much more likely that if the variable is not normally distributed neither will its residuals. However, Central Limit Theorem suggests that if you have enough observations, then the normality of your variable or its residuals shouldn’t matter. In any case we are interested in exploring whether or not our variables are distributed, so in the case that they are not, we can transform them and preserve the assumptions of the regression we use. That is our intention because we hope to make an interpretable model and having normally distributed variables, and by extension residuals, our model will be easier to interpret.

Correlation is a standardized measure that informs us how strong the relationship is between two variables. This measure of relational strength between variables is usually shown as r. The value of r ranges from \(-1 \leq r \geq 1\). A value of 1 or -1 implies a strong correlation, but the slope and relationship will be negative with a value of -1, and positive with a value of 1. A value of 0 implies there is no linear relationship at all. It is possible that with a value of 0, two variables do have some strong correlation, but it will not be linear as the value r only calculates linear relationships.

For our data we used Pearson’s Correlation equation:

\[r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}\]

Typically, if any two predictors have a correlation of \(-0.8 < r > 0.8\), then we might need to remove them from the model due to their multicollinearity. Multicollinearity occurs when two variables are very strongly correlated to each other. If two variables with strong correlations were incorporated into the same model, there will be little benefit and our coefficient estimates may be less reliable. Since we are concerned with the interpretability of our model, we intended to avoid multicollinearity.

c. Multiple Regression Analysis

We will use multiple Ordinary Least Squares (OLS) regression to examine the relationship between our variable of interest and our explanatory variables. OLS multiple regression determines the strength and direction (positive, negative, zero) of the relationship, and goodness of model fit. We calculate the coefficient 𝛽i of each predictor which is interpreted as the amount by which the dependent variable, in our case median house value, changes as the independent variable increases by one unit, holding all other predictors constant. The model includes the error term 𝜀, and is defined, for each observation i, as the vertical distance from the observed value and the predicted value of y. The error term is included in the equation to allow a point to fall either above or below the regression line. We are regressing median house value on percentage of homes living below poverty, percentage of individuals with a bachelor’s degree or higher, percentage of vacant homes, and percentage of single house units in Philadelphia. Our regression equation is as follows:

\[y = MEDHVAL = \beta_{0} + \beta_{1}LNBELPOV100 + \beta_{2}PCTBACHMOR + \beta_{3}PCTVACANT + \beta_{4}PCTSINGLES + \epsilon\]

OLS multiple regression models hold assumptions, the first being a linear relationship between x and y, which was examined via our scatter plots in Figure 3. The second assumption is normality of residuals. This is important for point estimation, confidence intervals, and hypothesis tests for small samples due to the Central Limit Theorem. Normality is essential for all sample sizes in order to predict future values of y. If residuals appear to have a non-normal distribution, it may indicate a non-linear relationship, or non-normal distributions of the dependent/independent variables. This may be solved with a logarithmic transformation of the variables. The third and fourth assumptions are that residuals are random and homoscedastic. There should be no relationship or pattern between the observed values and the predicted values. Homoscedasticity means the variance of residuals should look constant for different values when plotted. The fifth assumption states that observations and residuals must be independent. If data has a temporal or spatial component, residuals of observations that are close in time or space will be autocorrelated, or dependent on one another. In this case, time series or spatial regression should be used over OLS. The final assumption of multiple OLS regression is the absence of multicollinearity. Said differently, the predictors should not be strongly correlated with one another. The presence of multicollinearity causes unstable coefficient estimates and weakens the model. The parameters of multiple regression are coefficients β0 ,…, βk, where k = number of predictors. β0 represents the y- intercept of the regression line. β1 … βk represent the coefficients of variables x1 … xk. Each independent variable will have its own slope coefficient which will indicate the relationship of that predictor with the dependent variable, controlling for all other independent variables in the regression. OLS regression calculates the sum of the least squares of residuals. The least squares estimates for β0,…, βk are obtained when the quantity for SSE (equation below)is minimized.

\[ SSE = \sum_{i=1}^{n} \epsilon^2 = \sum_{i=1}^{n} (y-\hat{y})^2 = \sum_{i=1}^{n}(y_{i}- \hat{\beta}_{0}- \hat{\beta}_{1}x_{1i}- \hat{\beta}_{2}x_{2i}- ...- \hat{\beta}_{k}x_{ki})^2 \] Variance is the other parameter that needs to be estimated in multiple regression, calculated as \(σ^2=\frac{SSE}{(n-(k+1))}=MSE\) , where k = number of predictors, n = number of observations, and MSE stands Mean Squared Error. In multiple regression, R2 is the coefficient of multiple determination, or the proportion of variance in the model explained by all k predictors, represented by the equation \(R^2=1-\frac{SSE}{SST}\) . The R2 will increase with more predictors in the model and can be adjusted for the number of predictors with the equation \(R_{adj}^2=\frac{(n-1) R^2-k}{n-(k+1)}\) The model utility test, F-ratio, is conducted on the regression model to determine a goodness of fit measure. It can be interpreted as a significance test for R2. The F-ratio tests the null hypothesis, H0 that all coefficients in the model are jointly zero, vs the alternative hypothesis Ha that at least one of the coefficients is not 0. Said differently, we test to make sure none of the independent variables is a significant predictor o the dependent variable. From the F-ratio we are looking for a P value that is less than 0.05. Once an F-ratio is determined, we run a T-test on every individual predictor. The null hypothesis states that the predictor (ie; Percentage of vacant homes) has no association with the dependent variable, once again in our case is median house value. Our goal is to reject the null hypothesis H0 in favor of the alternative hypothesis Ha which is Βi ≠ 0 for each predictor.

d. Additional Analysis

One of the methods uses is the Stepwise regression is the step-by-step iterative creation of a regression version that includes the choice of impartial variables for use in a very last version. It includes including or getting rid of capability explanatory variables in succession and checking out for statistical importance after every iteration. The foremost drawbacks of stepwise more than one regression consist of bias in parameter estimation, inconsistencies amongst version choice algorithms, an inherent (however regularly overlooked) trouble of more than one speculation testing, and an irrelevant consciousness or reliance on a unmarried high-quality version. One of the other limitations of stepwise regression model is that has now no longer been emphasised is that stepwise estimates aren’t invariant to inconsequential linear transformations.
\[ {b}_{j.std} = {b}_{j}({Sx}_{j}/{Sy}) \]

Where Sy and Sxj are the standard deviations for the dependent variable and the corresponding jth independent variable

Cross-validation is a re sampling technique used to evaluate machine learning models on a limited data sample. The technique has a single parameter known as k that refers back to the wide variety of agencies that a given statistics pattern is to be split into. As such, the technique is frequently known as k-fold cross-validation. When a particular value for k is chosen, it is able to be utilized in location of k within side the connection with the version, which include k=5 turning into 5-fold cross-validation. Cross-validation is usually utilised in applied machine learning to estimate the ability of a system getting to know version on unseen statistics. That is, to apply a confined pattern for you to estimate how the version is predicted to carry out in popular whilst used to make predictions on statistics now no longer used for the duration of the education of the version. The method is commonly used as it is straightforward to recognise and as it outcomes are much less biased or much less positive estimate of the version ability than different methods.

\[ CV(λ)= 1/k\sum_{k=1}^{k} E_{k}(λ) \] Root Mean Square Error (RMSE) is a standard deviation of the residuals (prediction errors). Residuals are a measure to know how far from the regression line data points are; RMSE is a measure to unfold out those residuals are. In different words, it tells the concentration of the data around the line and its best fit. RMSE is a usually used for climatology, forecasting, and regression analysis to verify experimental results.

Results

a. Exploratory Results

To gain a better understanding of our observations we calculated summary statistics and examined the distribution of the variables. Below, Table 1 shows the summary statistics calculated for our dependent variable, Median House Value (MEDHVAL) and our four independent variables:

  • Number of Households Living Below the Poverty Line (NBELPOV)

  • Percent of housing units that are detached single family houses (PCTSINGLES)

  • Percent of housing units that are vacant (PCTVACANT)

  • Percent of residents in Block Group with at least a bachelor’s degree (PCTBACHMOR)

All of our variables have a standard deviation that’s close to or greater than the mean of our variables. This isn’t unlikely, as standard normal distribution yields a mean of 0 and a standard deviation of 1. What the standard deviations do show is that there is not a lot of precision in our data set, but given that nature of our data, that is okay.

The distribution of our raw variables are shown in Figure 1. We thought it wise to also illustrate the distribution of the natural log of all our variables, as shown in Figure 2. From these visualizations, it was clear that both the dependent variable and NBELPOV were not normally distributed, and as a result were likely not great for our linear regression. A linear regression model assumes that the residuals, or the estimate of error of observations in our sample, for each variable is normally distributed. And while it is possible for a variable with non-normal distribution to have normally distributed residuals, it is more likely that that variables with non-normal distributions also have non-normally distributed residuals. If it were the case that the residuals of our variables were not normally distributed, then we would be in violation of the assumption of linear regression. Though there is a case to be made, that we meet the criteria of Central Limit Theorem, based on the amount of observations we have, we have decided to address to non-normally distributed variables, so that we can confidently perform statistically hypothesis testing with our model.

With this motivation, we concluded that it would be best to use the natural logged dependent variable and NBELPOV. Additionally we chose to use a natural log (and a natural log +1, where there were zeros in the observations), because those two variables were positively skewed. We used the raw versions of the remaining variables. We will make more regression assumption checks further in this document.

Mean sd
Median House Value 66287.733139 60006.08
% of Indviduals with Bachelor’s Degrees or Higher 16.081372 17.76956
% of Vacant Houses 11.288529 9.628472
% of Single House Units 9.226473 13.24925
% Households Living in Poverty 189.770930 164.3185
Summary Statistics
Table 1

We next test the relationship between our predictor variables using a correlation measure to make sure our model is free of multicollinearity. In Table 1 below, you’ll notice that each predictor’s correlation with itself is 1. But none of our predictor variables have a correlation with another > 0.8 or <-0.8, so our model will be free from multicollinearity.The other regression assumptions will be explored in the Regression Assumption Checks section below.

Next, we looked at in our exploration of our data is the distribution of our predictor’s values across space. Below are a series of maps that show this distribution, with values broken out by Jenks breaks. Map 1 shows us how our dependent variable, LN Median House Value, differs across census tract, and Map 2 shows our predictors variables across tracts. Most have 5 breaks, but the values for LN %BELPOV were note varied enough so we included only three breaks.

The map of the percentage of houses that are vacant looks like the negative version of the map our dependent variable, LN of medial house value. Logically, that makes sense. Where houses are vacant, the value is low – a clear relationship. This strong relationship will not be a problem for our model, since it is between the dependent variable and a predictor.

On the other hand, the map of the percentage of individuals with a bachelor’s or higher looks the most like our dependent variable, while the percentage of households living in poverty looks almost the opposite. It also appears as though where there are people with at least Bachelors, there are also single house units and house value is higher. Because of this it might be the case that our bachelor’s variable is somewhat, if not strongly, correlated with our poverty variables, in a negative direction. This does agree with the ideas we started this analysis with, I.e., the less means you have the less likely you might be to complete college and the less likely you will own an expensive house. These are of course generalizations, and we must note that our model cannot predict the characteristics of any one person’s life. The most our model can do is illuminate trends. Our next step is to identify whether the relationship between our predictors such as our poverty and bachelor’s variables will be an issue for the assumptions of our model. More on this below.

LNMEDHVAL LNBELPOV100 PCTBACHMOR PCTSINGLES PCTVACANT
LNMEDHVAL 1.0000 -0.4241 0.7357 0.2654 -0.5143
LNBELPOV100 -0.4241 1.0000 -0.3198 -0.2905 0.2495
PCTBACHMOR 0.7357 -0.3198 1.0000 0.1975 -0.2984
PCTSINGLES 0.2654 -0.2905 0.1975 1.0000 -0.1514
PCTVACANT -0.5143 0.2495 -0.2984 -0.1514 1.0000
Pearson’s Correlation of Predictors
Table 2

In our correlation matrix, shown in Table 2, we can see that the variables that have the greatest correlation are in fact the percentage of the population with a bachelor’s degree or more and our dependent variable, LNMEDHVAL. That is a relationship we were able to spot in comparing the maps of those two variables and one that is welcome in our model. As is the next strongest relationship between the dependent variable and our predictor, percentage of vacant houses. The relationship we spotted between poverty and education in our maps is shown to be one of the strongest between all our variables, but with an r of –0.3198, we will not be in violation of any assumptions of our regression. In fact, none of variable matchups produce an r greater than 0.8 or less than –0.8, so we are clear to continue with using every variable.

b. Regression Results

After we explore our data and transform our variables as needed, the final regression equation is as follows:

\[y = LNMEDHVAL = \beta_{0} + \beta_{1}LNBELPOV100 + \beta_{2}PCTBACHMOR + \beta_{3}PCTVACANT + \beta_{4}PCTSINGLES + \epsilon\] We regressed the natural log of median house value (LNMEDHVAL) on the natural log of the number of people living below poverty (LNBELPOV100), the percent of individuals with Bachelor’s Degrees or higher (PCTBACHMORE), the percent of vacant houses (PCTVACANT), and the percent of single house units (PCTSINGLES). The regression output tells us that these variables are highly significant predictors and are positively associated with the median house value (p<0.0001 for all variables). The coefficients will have a modified interpretation because we have used a logarithmic transformation on the dependent variable. To interpret the beta coefficient of LNBELPOV100, we can say a 1% increase in the number of individuals living in poverty corresponds to a \((1.01^{\beta_1} - 1)*100 = (1.01^{-.079} - 1) * 100 = -0.0786 \%\) change (i.e., a .0786% decrease) in median house value. The regression also tells us that a 1% increase in the percent of individuals with a Bachelor’s degree or more is associated with a change in median house value by \(\beta_2 ln(\frac {101}{100}) = 0.0209 ln (\frac {101}{100}) = 0.00021\) $ (i.e., an increase in median house value by 0.00021 $). The next coefficients will be interpreted in a similar way. A 1% increase in percent of vacant homes corresponds to an approximately \(\beta_3 ln(\frac {101}{100}) = 0.0191 ln (\frac {101}{100}) = 0.00019\) $ increase in median house value. Lastly, a 1% increase in percent of single unit homes corresponds to an approximately \(\beta_4 ln(\frac {101}{100}) = 0.0029 ln (\frac {101}{100}) = 0.00003\) $ increase in median house value.

The p value of less than 0.0001 for LNBELPOV100 tells us that if there is actually no relationship between LNBELPOV100 and the dependent variable LNMEDHVAL (i.e., if the null hypothesis that \(\beta_1 =0\) is actually true), then the probability of getting a \(\beta_1\) coefficient estimate of -0.0789054 is less than 0.0001. Similarly, the p-value of less than 0.0001 for PCTBACHMORE tells us that if there is actually no relationship between PCTBACHMORE and the dependent variable MEDHVAL (i.e., if the null hypothesis that \(\beta_2=0\) is actually true), then the probability of getting a \(\beta_2\) coefficient estimate of 0.02091 is less than 0.0001. The same interpretations can be made for the p values of PCTVACANT and PCTSINGLES - both of these predictors are statistically significant with a very low p value < 0.0001. These low probabilities indicate that we can safely reject
\(H_0: \beta_1 = 0\) for \(H_a: \beta_1 ≠ 0\),
\(H_0: \beta_2 = 0\) for \(H_a: \beta_2 ≠ 0\),
\(H_0: \beta_3 = 0\) for \(H_a: \beta_3 ≠ 0\),
\(H_0: \beta_4 = 0\) for \(H_a: \beta_4 ≠ 0\)
(at most reasonable levels of α = P(Type I error)).

A over half of the variance in the dependent variable is explained by the model (R2 and Adjusted R2 are 0.6623 and 0.6615, respectively). The low p-value associated with the F-ratio shows that we can reject the null hypothesis that all coefficients in the model are 0.

term estimate std.error statistic p.value
(Intercept) 11.1137661 0.0465330 238.836351 0.0000000
LNBELPOV100 -0.0789054 0.0084569 -9.330279 0.0000000
PCTBACHMOR 0.0209098 0.0005432 38.493943 0.0000000
PCTVACANT -0.0191569 0.0009779 -19.590280 0.0000000
PCTSINGLES 0.0029769 0.0007032 4.233544 0.0000242
Regression Summary
Table 3
r.squared adj.r.squared sigma f.statistic p.value df logLik AIC BIC deviance df.residual nobs
0.662297 0.6615093 0.3664848 840.8567 0 4 -711.5376 1435.075 1467.776 230.3435 1715 1720
Regression Summary
Table 4
term df sumsq meansq statistic p.value
LNBELPOV100 1 122.6594 122.6594 913.2481 0
PCTBACHMOR 1 273.6043 273.6043 2037.0939 0
PCTVACANT 1 53.0746 53.0746 395.1618 0
PCTSINGLES 1 2.4072 2.4072 17.9229 0
Residuals 1715 230.3435 0.1343 NA NA
ANOVA
Table 5
c. Regression Assumption Checks

This section will talk about testing model assumptions and aptness. The histograms of the variable distributions were presented earlier in the exploratory results section. Here, we will be checking to see if the assumption of a linear relationship between out dependent variable, LNMEDHVAL, and each of its predictors holds true. The scatter plots in Figure 3 display the relationship between each predictor and the natural log of median house value. As we can see from the scatter plots, PCTBACHMORE and LNMEDHVAL appear to have what most resembles a linear relationship, however the other scatter plots do not appear to be exactly linear. While this doesn’t quite meet our first assumption, there is no strong polynomial relationship and for our case, a linear model would be the best fit through these observations.

The second assumption of OLS regression is the normality of residuals. Although this assumption isn’t as important as some others, especially if we are working with data of a large sample size, it is still worth checking to see if it has been met. Figure 4 presents a histogram of our standardized residuals. A standardized residual is simply the residuals divided by their standard error. The residuals have been standardized for the purpose of these assumption checks to compare residuals for different observations to each other. The distribution appears to be normal and this may be attributed to the normality of our log transformed dependent variable.

The third assumption we will check for in our regression model is homoscedasticity of residuals. homoscedasticity refers to the randomness of variance in residuals regardless of the values of each x and indicates accuracy of model prediction. Figure 5 displays the relationship between the standardized residuals and the predicted median house value from our model in a scatter plot. Standardized residuals are our raw residuals divided by the standard error of estimate. Our model appears to meet the assumption of homoscedasticity because the variance residuals are largely random. There is no pattern in the residuals that would indicate systematic over or under prediction. Using standardized residuals while checking for homoscedasticity will also reveal outlier observations in the data set.

We next make the assumption that our predictors are not spatially autocorrelated. Earlier in the report, under the exploratory results section, we plotted each predictor by census tract to see of how each of the predictors related to itself in space. All of the predictors, according to the choropleth maps, show clustering in space. This breaks our regression model’s assumption of lack of spatial autocorrelation. This is not surprising as we would expect, in alignment with Tobler’s first law of geometry, that nearer things are more related than distant things. For example, one census block group adjacent to another block group with high percentage of vacant homes, is more likely to also have a high percentage of vacant homes than not. The same can be applied to our other predictors as well. Wealth and poverty and their associated attributes will cluster around themselves. Map 3 presents a choropleth map of the standardized residuals by census block group. Here we observe a less obvious pattern of clustering. While the north and center of Philadelphia have some clustering of negative residual values, the standardized residuals appear to have a more random spatial pattern overall, as compared to the predictor value chlorpleth maps previously discussed.

d. Additional Models
Step Df Deviance Resid. Df Resid. Dev AIC
NA NA 1715 230 -3448
Stepwise ANOVA
Table 5
term estimate std.error statistic p.value
(Intercept) 10.430 0.033 315.9 0
MEDHHINC 0.000 0.000 29.0 0
PCTVACANT -0.019 0.001 -15.2 0
Regression Summary
Table 6
r.squared adj.r.squared sigma f-statistic p.value df logLik AIC BIC deviance df.residual nobs
0.507 0.506 0.443 882 0 2 -1037 2082 2104 336 1717 1720
Regression Summary
Table 7
term df sumsq meansq statistic p.value
MEDHHINC 1 300.2 300.215 1532 0
PCTVACANT 1 45.5 45.488 232 0
Residuals 1717 336.4 0.196 NA NA
ANOVA
Table 8
mse rmse
regression.1 (4 vars) 0.134 0.366
regression.2 (2 vars) 0.196 0.443
Errors For Both Regressions
Table 9

All the three predictors are kept that means the original models built for preceptors is best models since it has the lowest AIC. The RMSE for the 4 predictor is 0.366 and the RMSE for the 2 predictor (Percent vacant and Median Household Income) is 0.443. Since the 4 predictor RMSE is lower than the 2 predictor RMSE, the original model is the best.

Limitations & Conclusion

In this study we set out to investigate to what extent median house value is a factor of various neighborhood characteristics, and which characteristics were more powerful than others. As a result, in our first model that used four predictors we saw that education was the greatest factor, more specifically the percentage of people in a block group with at least a bachelor’s degree. This variable alone had a r of 0.7357, meaning it was positively correlated with median house value. In other words, as the percentage of people with at least a bachelor’s degree increases, so does house value. This isn’t very surprising given the ideas we laid out in the introduction. Higher education in the United States often requires a lot of financial resources and typically, those with college degree receive greater incomes. Thus, it’s not surprising that where there are college graduates, there are also higher home values.

Our first model, that uses the natural log of the number of people living below the poverty line, the percentage of people with at least a bachelor’s degree, the percentage of vacant house unity, and the percentage of single house units, does yield the best model given our predictors. When investigating the influence our predictors had on median house value, we saw that our f-statistic was associated with a very low p-value <0.0000000000000002, meaning that our model held some promise and that at least one of our predictors had significant influence on our dependent value. Further investigation using stepwise regression informed us that in fact every one of our predictors was useful in predicting median house value. Additionally, we wanted to verify that our four predictors were worth keeping, so we built a model using only two variables: percentage of vacant units and median household value. On both the four and two predictor models, we ran a cross-validation, then we compared the root mean squared error.

If we were to bring in new variables it might be interesting to look at the effect that greenspace, specifically tree presence, had on median house value. It’s our assumption that greenspace and trees are often found in areas with higher incomes and in areas that have been greatly influenced by historic red lining practices. Similarly, it would be interesting to incorporate historical red lining boundaries to investigate the effects they have on median house value.

While our model is strong, we must acknowledge that our model does not account at all for spatial autocorrelation. We can see in Maps 3 and 4 that the spatial distribution of our standardized residuals appear to be clustered. What this means for our model is that we are missing a variable to account for the spatial influence on median house value. Therefore, without adding such a variable our model will never be as accurate as one that does account for this.

Additionally, we must note that our poverty variable, the number of people living below the poverty line, is a raw number. Some might say that that poses its own limitation on our model. We are not so sure. It is a fair criticism to say that the raw number does not supply context for how many people live in the block group. And as a result, we lack the information to correctly picture the effect that density of poverty in a block group has on median house value. Nevertheless, we think that due to the nature of what our poverty variable represents –I.e., the conditions of real human life – we think it helpful to make an economic case for addressing poverty. In other words, if we can show, as we did, that the greater number of persons in poverty (no matter the density), negatively affects median house value, then that might supply an economic, rather than a moral, argument for addressing poverty. The important link here is that, as we addressed in the introduction, many local governments rely on property taxes to fund their budgets. The higher the value of the home, the higher the house prices and possibly the greater the local government budget is. While we strive to incorporate the most stringent statistically analysis, we must acknowledge that statistics as a field is as much an art as it is a science.